Read in (again) the gambling dataset:
gambling.data <- read.csv(file = "http://data.justice.qld.gov.au/JSD/OLGR/20170817_OLGR_LGA-EGM-data.csv",
header = TRUE,
sep = ",",
stringsAsFactors = FALSE)
# rename columns
names(gambling.data)[2] <- "Local.Govt.Area"
names(gambling.data)[7] <- "Player.Money.Lost"
names(gambling.data)[1] <- "Month.Year"
gambling.data <- na.omit(gambling.data)
#Add a day of month (1st) to each date string
date.string <- paste0( "1 " , gambling.data$Month.Year )
#Convert to POSIXlt, a date-time format
strptime( date.string , format = "%d %B %Y" ) -> gambling.data$Date
# subset to Brisbane only
brisbane.only <- gambling.data[gambling.data$Local.Govt.Area=="BRISBANE",]
# Choose only the brisbane data from 2010
#row.indicies <- (brisbane.only$Date>="2010-01-01 AEST" &
# brisbane.only$Date<="2010-12-31 AEST")
#brisbane.2010.data <- brisbane.only[row.indicies,]
One of the really cool things about R, as opposed to other languages like python, matlab etc. is that it has such a wide user base in the statistics community and many packages that have been developed to analyze data with old and new statistical techniques and modelling. Most of these (but not all) are available on CRAN. Although we won’t cover much beyond the basics today (which could just as easily be done in most other languages), more advanced things like bayesian techniques
Measures of Central Tendency
All the standard summary statistics are available in R including:
x <- c(1,3,5,2,9,NA,7,10,15,4,7,7)
x
## [1] 1 3 5 2 9 NA 7 10 15 4 7 7
mean(x)
## [1] NA
If there are missing values, we can ignore them by setting na.rm=TRUE:
mean(x, na.rm=TRUE)
## [1] 6.363636
median(x,na.rm = T)
## [1] 7
var(x, na.rm = T) # variance
## [1] 16.25455
sd(x, na.rm = T) # standard deviation
## [1] 4.031693
In R, you can use the function quantile() to get median and quartiles, or you can also use the function summary(), to get the quantiles as well as the mean, and the number of NA values:
quantile(x, na.rm = T)
## 0% 25% 50% 75% 100%
## 1.0 3.5 7.0 8.0 15.0
# mean,median,25th and 75th quartiles,min,max
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 3.500 7.000 6.364 8.000 15.000 1
You get most of the same summary statistics visually from a boxplot: (in base R graphics)
boxplot(x, horizontal = TRUE)
Or alternatively in ggplot:
library(ggplot2)
some.LGAs<-sample(gambling.data$Local.Govt.Area,5)
subset.data<-gambling.data[gambling.data$Local.Govt.Area %in% some.LGAs,]
ggplot(data=subset.data,
aes(x=Local.Govt.Area,y=Approved.EGMs))+
geom_boxplot()+
scale_y_log10()+
coord_flip()+
theme_minimal()
We can also look at the histogram of the counts of the data. To obtain a histogram, the scale of the variable is divided into consecutive intervals and the number of observations in each interval is counted.
hist(x)
Or plot a histogram in ggplot:
ggplot(data=subset.data,
aes(x=Approved.EGMs,fill=Local.Govt.Area))+
geom_histogram()+
scale_x_log10()+
theme_minimal()
x1=sin(1:100/3)
y1=sin(1:100/3)
plot(x1,y1)
cor(x1,y1)
## [1] 1
x2=rnorm(100)
y2=rnorm(100)
plot(x2,y2)
cor(x2,y2)
## [1] -0.1854688
Calculating correlations for many variables at once
#install.packages("corrplot")
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.4.2
## corrplot 0.84 loaded
MoneyDateLGA<-gambling.data[,c("Player.Money.Lost","Date","Local.Govt.Area")]
spreaddata<-tidyr::spread(MoneyDateLGA,3,1)
#remove the columns with NA values
spreaddata$TORRES<-NULL
spreaddata$LONGREACH<-NULL
spreaddata$BALONNE<-NULL
#plot corrplot
corrplot(cor(spreaddata[,-1]),
order="hclust",
tl.cex = .5,
tl.col = "black")
#install.packages("GGally")
library(GGally)
ggpairs(gambling.data[,-c(1,2)])
There are other functions and packages to obtain summary statistics of the datasets:
stat.desc() in the pastecs packagedescribe() in the Hmisc packagedescribe() in the psych package…Simple linear regression is useful for examining or modelling the relationship between two variables.
ggplot(data=gambling.data,
aes(x=Operational.EGMs,
y=Player.Money.Lost,
color=Local.Govt.Area,
group=1))+
guides(color=FALSE)+
geom_point()+
geom_smooth(method = "lm",se=TRUE)
Fit a linear regression and look at a summary of its results. The model is of the form \(y= mx+c\) where \(y=\)Player.Money.Lost and \(x=\)Operational.EGMs
dollars_per_machine_model <- lm(Player.Money.Lost ~ Operational.EGMs,
data = gambling.data)
summary(dollars_per_machine_model)
##
## Call:
## lm(formula = Player.Money.Lost ~ Operational.EGMs, data = gambling.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9146097 -246066 408 153205 11721046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.724e+05 1.557e+04 -17.49 <2e-16 ***
## Operational.EGMs 4.152e+03 8.386e+00 495.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1083000 on 6673 degrees of freedom
## Multiple R-squared: 0.9735, Adjusted R-squared: 0.9735
## F-statistic: 2.452e+05 on 1 and 6673 DF, p-value: < 2.2e-16
attributes(dollars_per_machine_model)
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
##
## $class
## [1] "lm"
dollars_per_machine_model$coefficients
## (Intercept) Operational.EGMs
## -272416.501 4152.383
confint(dollars_per_machine_model)
## 2.5 % 97.5 %
## (Intercept) -302943.761 -241889.241
## Operational.EGMs 4135.944 4168.822
anova(dollars_per_machine_model)
## Analysis of Variance Table
##
## Response: Player.Money.Lost
## Df Sum Sq Mean Sq F value Pr(>F)
## Operational.EGMs 1 2.8758e+17 2.8758e+17 245187 < 2.2e-16 ***
## Residuals 6673 7.8268e+15 1.1729e+12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# visualize residuals and fitted values.
plot(dollars_per_machine_model,
pch=16,
which=1)
hist(dollars_per_machine_model$residuals,
breaks=1000,
xlim=c(-2e6,2e6))
sd(dollars_per_machine_model$residuals)
## [1] 1082929
We can make a much better model by including more variables. The functions we will use are exactly the same though.
dollars_model <- lm(Player.Money.Lost ~ Operational.EGMs + Operational.Sites + as.numeric(Date) + Local.Govt.Area,
data = gambling.data)
summary(dollars_model)
##
## Call:
## lm(formula = Player.Money.Lost ~ Operational.EGMs + Operational.Sites +
## as.numeric(Date) + Local.Govt.Area, data = gambling.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5480600 -179541 -11605 168530 6292338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.375e+06 1.315e+05 10.462 < 2e-16
## Operational.EGMs 4.950e+03 1.338e+02 37.005 < 2e-16
## Operational.Sites -2.200e+05 3.490e+03 -63.027 < 2e-16
## as.numeric(Date) -2.966e-04 8.724e-05 -3.400 0.000677
## Local.Govt.AreaBANANA 8.594e+05 8.058e+04 10.665 < 2e-16
## Local.Govt.AreaBRISBANE 3.862e+07 1.211e+06 31.896 < 2e-16
## Local.Govt.AreaBUNDABERG 6.010e+06 1.807e+05 33.263 < 2e-16
## Local.Govt.AreaBURDEKIN 2.324e+05 7.930e+04 2.931 0.003393
## Local.Govt.AreaCAIRNS 6.187e+06 2.298e+05 26.920 < 2e-16
## Local.Govt.AreaCASSOWARY COAST 1.542e+06 8.892e+04 17.341 < 2e-16
## Local.Govt.AreaCENTRAL HIGHLANDS 2.336e+06 9.475e+04 24.659 < 2e-16
## Local.Govt.AreaCHARTERS TOWERS 6.474e+05 7.995e+04 8.097 6.63e-16
## Local.Govt.AreaCLONCURRY 1.948e+04 7.843e+04 0.248 0.803842
## Local.Govt.AreaDOUGLAS 6.988e+05 7.950e+04 8.791 < 2e-16
## Local.Govt.AreaFRASER COAST 5.469e+06 1.806e+05 30.288 < 2e-16
## Local.Govt.AreaGLADSTONE 3.616e+06 1.114e+05 32.456 < 2e-16
## Local.Govt.AreaGOLD COAST 2.261e+07 7.475e+05 30.240 < 2e-16
## Local.Govt.AreaGOONDIWINDI 7.911e+05 8.018e+04 9.866 < 2e-16
## Local.Govt.AreaGYMPIE 2.494e+06 9.476e+04 26.321 < 2e-16
## Local.Govt.AreaHINCHINBROOK 5.139e+05 7.875e+04 6.526 7.25e-11
## Local.Govt.AreaIPSWICH 6.552e+06 2.098e+05 31.231 < 2e-16
## Local.Govt.AreaISAAC 2.219e+06 8.869e+04 25.021 < 2e-16
## Local.Govt.AreaLIVINGSTONE 1.182e+06 8.404e+04 14.060 < 2e-16
## Local.Govt.AreaLOCKYER VALLEY 2.386e+06 9.001e+04 26.510 < 2e-16
## Local.Govt.AreaLOGAN 8.916e+06 2.629e+05 33.917 < 2e-16
## Local.Govt.AreaLONGREACH -8.728e+04 8.953e+04 -0.975 0.329631
## Local.Govt.AreaMACKAY 6.274e+06 2.213e+05 28.355 < 2e-16
## Local.Govt.AreaMARANOA 5.808e+05 7.916e+04 7.338 2.43e-13
## Local.Govt.AreaMAREEBA 1.569e+05 7.959e+04 1.971 0.048738
## Local.Govt.AreaMORETON BAY 1.127e+07 4.058e+05 27.779 < 2e-16
## Local.Govt.AreaMOUNT ISA 1.394e+05 9.255e+04 1.507 0.131985
## Local.Govt.AreaMURWEH 3.045e+05 7.863e+04 3.872 0.000109
## Local.Govt.AreaNOOSA 2.023e+06 1.014e+05 19.944 < 2e-16
## Local.Govt.AreaNORTH BURNETT 7.991e+05 7.952e+04 10.050 < 2e-16
## Local.Govt.AreaREDLAND 4.060e+06 1.752e+05 23.175 < 2e-16
## Local.Govt.AreaROCKHAMPTON 6.143e+06 1.680e+05 36.556 < 2e-16
## Local.Govt.AreaSCENIC RIM 2.661e+06 9.186e+04 28.969 < 2e-16
## Local.Govt.AreaSOMERSET 1.424e+06 8.175e+04 17.414 < 2e-16
## Local.Govt.AreaSOUTH BURNETT 2.485e+06 9.413e+04 26.401 < 2e-16
## Local.Govt.AreaSOUTHERN DOWNS 2.004e+06 9.396e+04 21.332 < 2e-16
## Local.Govt.AreaSUNSHINE COAST 1.138e+07 3.931e+05 28.945 < 2e-16
## Local.Govt.AreaTABLELANDS -6.438e+03 7.874e+04 -0.082 0.934835
## Local.Govt.AreaTOOWOOMBA 8.595e+06 2.361e+05 36.411 < 2e-16
## Local.Govt.AreaTORRES 1.678e+05 4.807e+05 0.349 0.727023
## Local.Govt.AreaTOWNSVILLE 7.903e+06 2.260e+05 34.961 < 2e-16
## Local.Govt.AreaWESTERN DOWNS 3.432e+06 1.030e+05 33.304 < 2e-16
## Local.Govt.AreaWHITSUNDAY 2.948e+06 1.011e+05 29.164 < 2e-16
##
## (Intercept) ***
## Operational.EGMs ***
## Operational.Sites ***
## as.numeric(Date) ***
## Local.Govt.AreaBANANA ***
## Local.Govt.AreaBRISBANE ***
## Local.Govt.AreaBUNDABERG ***
## Local.Govt.AreaBURDEKIN **
## Local.Govt.AreaCAIRNS ***
## Local.Govt.AreaCASSOWARY COAST ***
## Local.Govt.AreaCENTRAL HIGHLANDS ***
## Local.Govt.AreaCHARTERS TOWERS ***
## Local.Govt.AreaCLONCURRY
## Local.Govt.AreaDOUGLAS ***
## Local.Govt.AreaFRASER COAST ***
## Local.Govt.AreaGLADSTONE ***
## Local.Govt.AreaGOLD COAST ***
## Local.Govt.AreaGOONDIWINDI ***
## Local.Govt.AreaGYMPIE ***
## Local.Govt.AreaHINCHINBROOK ***
## Local.Govt.AreaIPSWICH ***
## Local.Govt.AreaISAAC ***
## Local.Govt.AreaLIVINGSTONE ***
## Local.Govt.AreaLOCKYER VALLEY ***
## Local.Govt.AreaLOGAN ***
## Local.Govt.AreaLONGREACH
## Local.Govt.AreaMACKAY ***
## Local.Govt.AreaMARANOA ***
## Local.Govt.AreaMAREEBA *
## Local.Govt.AreaMORETON BAY ***
## Local.Govt.AreaMOUNT ISA
## Local.Govt.AreaMURWEH ***
## Local.Govt.AreaNOOSA ***
## Local.Govt.AreaNORTH BURNETT ***
## Local.Govt.AreaREDLAND ***
## Local.Govt.AreaROCKHAMPTON ***
## Local.Govt.AreaSCENIC RIM ***
## Local.Govt.AreaSOMERSET ***
## Local.Govt.AreaSOUTH BURNETT ***
## Local.Govt.AreaSOUTHERN DOWNS ***
## Local.Govt.AreaSUNSHINE COAST ***
## Local.Govt.AreaTABLELANDS
## Local.Govt.AreaTOOWOOMBA ***
## Local.Govt.AreaTORRES
## Local.Govt.AreaTOWNSVILLE ***
## Local.Govt.AreaWESTERN DOWNS ***
## Local.Govt.AreaWHITSUNDAY ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 674500 on 6628 degrees of freedom
## Multiple R-squared: 0.9898, Adjusted R-squared: 0.9897
## F-statistic: 1.397e+04 on 46 and 6628 DF, p-value: < 2.2e-16
plot(dollars_model,
pch=16,
which=1)
hist(dollars_model$residuals,
breaks=1000,
xlim=c(-2e6,2e6))
sd(dollars_model$residuals)
## [1] 672180.9
A hypothesis test (or more precisely a p-value from a hypothesis test) will tell you the probability that the data you have is generated by the null model (your null hypothesis).
For example, I may have rolled a red die 10 times and got the following results:
redDie<-c(6, 4, 4, 6, 2, 1, 4, 6, 5, 2)
Let’s say that my “null model” is that the mean of many rolls will be 3.5, i.e., the result we would expect, on average, if the die is fair.
mean(redDie)
## [1] 4
Is this data consistent with our hypothesis?
redDieTTest<-t.test(redDie, alternative="two.sided", mu = 3.5 ,conf.level=0.95)
redDieTTest
##
## One Sample t-test
##
## data: redDie
## t = 0.86603, df = 9, p-value = 0.409
## alternative hypothesis: true mean is not equal to 3.5
## 95 percent confidence interval:
## 2.693943 5.306057
## sample estimates:
## mean of x
## 4
names(redDieTTest)
## [1] "statistic" "parameter" "p.value" "conf.int" "estimate"
## [6] "null.value" "alternative" "method" "data.name"
redDieTTest$parameter
## df
## 9
redDieTTest$p.value
## [1] 0.4089688
redDieTTest$conf.int
## [1] 2.693943 5.306057
## attr(,"conf.level")
## [1] 0.95
redDieTTest$estimate
## mean of x
## 4
redDieTTest$null.value
## mean
## 3.5
redDieTTest$alternative
## [1] "two.sided"
redDieTTest
##
## One Sample t-test
##
## data: redDie
## t = 0.86603, df = 9, p-value = 0.409
## alternative hypothesis: true mean is not equal to 3.5
## 95 percent confidence interval:
## 2.693943 5.306057
## sample estimates:
## mean of x
## 4
I then rolled a blue die 10 times and got the following results from it:
blueDie<-c(5, 1, 4, 2, 3, 1, 5, 4, 4, 1)
Are the mean values for ten roll different between the red and blue die (within statistical uncertainty) ?
redblueDieTTest<-t.test(redDie,blueDie, alternative="two.sided", mu = 0 ,conf.level=0.95)
redblueDieTTest
##
## Welch Two Sample t-test
##
## data: redDie and blueDie
## t = 1.291, df = 17.78, p-value = 0.2132
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.6288085 2.6288085
## sample estimates:
## mean of x mean of y
## 4 3
The P-value (0.3622) is greater than the significance level 5% (1-0.95), so we conclude that the null hypothesis that the mean of this population is 9 is plausible.
v <- rnorm(10,9, 6)
t.test(v-0, alternative="two.sided", conf.level=0.95)
##
## One Sample t-test
##
## data: v - 0
## t = 6.8047, df = 9, p-value = 7.864e-05
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 7.285841 14.542481
## sample estimates:
## mean of x
## 10.91416
Before proceeding with the t-test, it is necessary to evaluate the sample variances of the two groups, using a Fisher’s F-test to verify the homoskedasticity (homogeneity of variances). In R you can do this in this way:
a <- c(175, 168, 168, 190, 156, 181, 182, 175, 174, 179)
b <- c(185, 169, 173, 173, 188, 186, 175, 174, 179, 180)
var.test(a,b)
##
## F test to compare two variances
##
## data: a and b
## F = 2.1028, num df = 9, denom df = 9, p-value = 0.2834
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.5223017 8.4657950
## sample estimates:
## ratio of variances
## 2.102784
We obtained p-value greater than 0.05, then we can assume that the two variances are homogeneous. Indeed we can compare the value of F obtained with the tabulated value of F for alpha = 0.05, degrees of freedom of numerator = 9, and degrees of freedom of denominator = 9, using the function qf(p, df.num, df.den):
qf(0.95, 9, 9)
## [1] 3.178893
Note that the value of F computed is less than the tabulated value of F, which leads us to accept the null hypothesis of homogeneity of variances. NOTE: The F distribution has only one tail, so with a confidence level of 95%, p = 0.95. Conversely, the t-distribution has two tails, and in the R’s function qt(p, df) we insert a value p = 0975 when you’re testing a two-tailed alternative hypothesis.
t-Test to compare the means of two groups under the assumption that both samples are random, independent, and come from normally distributed population with unknow but equal variances To solve this problem we must use to a Student’s t-test with two samples, assuming that the two samples are taken from populations that follow a Gaussian distribution(test called Wilcoxon-Mann-Whitney test). For samples with Homogeneous variances (var.equal = TRUE),and independent samples (paired = FALSE: you can omit this because the function works on independent samples by default).
t.test(a,b, var.equal=TRUE, paired=FALSE)
##
## Two Sample t-test
##
## data: a and b
## t = -0.94737, df = 18, p-value = 0.356
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -10.93994 4.13994
## sample estimates:
## mean of x mean of y
## 174.8 178.2
The chi-square test of independence is a parametric method appropriate for testing independence between two categorical variables. Dataset use here is survey, the Smoke column records the students smoking habit, while the Exer column records their exercise level. The allowed values in Smoke are “Heavy”, “Regul” (regularly), “Occas” (occasionally) and “Never”. As for Exer, they are “Freq” (frequently), “Some” and “None”.
Question: test the hypothesis whether the students smoking habit is independent of their exercise level at 0.05 significance level (explore the relationship between variabele).
First: Make a the contingency table of the two variables, the students smoking habit against the exercise level using the table function in R. The result is called the contingency table of the two variables.
library(MASS) # load the MASS package
tbl <- table(survey$Smoke, survey$Exer)
tbl<-tbl[c(2,3,4,1),] # the contingency table
tbl
##
## Freq None Some
## Never 87 18 84
## Occas 12 3 4
## Regul 9 1 7
## Heavy 7 1 3
See the relationship using barplot:
barplot(tbl,
beside=T,
legend=T,
xlab = "Exercise",
ylab = "Number of People")
Solution:We apply the chisq.test function to the contingency table tbl, and found the p-value.
chisq.test(tbl)
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 5.4885, df = 6, p-value = 0.4828
Answer: As the p-value 0.4828 is greater than the .05 significance level, we do not reject the null hypothesis that the smoking habit is independent of the exercise level of the students.
Enhanced Solution The warning message found in the solution above is due to the small cell values in the contingency table. To avoid such warning, we can combine the second and third columns of tbl, and save it in a new table named ctbl. Then we apply the chisq.test function against ctbl instead.
ctbl = cbind(tbl[,"Freq"], tbl[,"None"] + tbl[,"Some"])
ctbl
## [,1] [,2]
## Never 87 102
## Occas 12 7
## Regul 9 8
## Heavy 7 4
EX_SM<- chisq.test(ctbl)
EX_SM
##
## Pearson's Chi-squared test
##
## data: ctbl
## X-squared = 3.2328, df = 3, p-value = 0.3571
attributes(EX_SM)
## $names
## [1] "statistic" "parameter" "p.value" "method" "data.name" "observed"
## [7] "expected" "residuals" "stdres"
##
## $class
## [1] "htest"
If the assumptions of Chi-square test are met we may consider using Fisher’s Exact test that is non-parametric alternative to the Chi-Square test.
fisher.test(tbl, conf.int=T, conf.level=0.99)
##
## Fisher's Exact Test for Count Data
##
## data: tbl
## p-value = 0.4138
## alternative hypothesis: two.sided
Anova is a parameteric method appropriate for comparing the Means for the 2 or more in dependent populations.
attach(InsectSprays)
data(InsectSprays)
str(InsectSprays)
## 'data.frame': 72 obs. of 2 variables:
## $ count: num 10 7 20 14 14 12 10 23 17 20 ...
## $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
We’re going to use a data set called InsectSprays. 6 different insect sprays (1 Independent Variable with 6 levels) were tested to see if there was a difference in the number of insects found in the field after each spraying (Dependent Variable).
InsectSprays
## count spray
## 1 10 A
## 2 7 A
## 3 20 A
## 4 14 A
## 5 14 A
## 6 12 A
## 7 10 A
## 8 23 A
## 9 17 A
## 10 20 A
## 11 14 A
## 12 13 A
## 13 11 B
## 14 17 B
## 15 21 B
## 16 11 B
## 17 16 B
## 18 14 B
## 19 17 B
## 20 17 B
## 21 19 B
## 22 21 B
## 23 7 B
## 24 13 B
## 25 0 C
## 26 1 C
## 27 7 C
## 28 2 C
## 29 3 C
## 30 1 C
## 31 2 C
## 32 1 C
## 33 3 C
## 34 0 C
## 35 1 C
## 36 4 C
## 37 3 D
## 38 5 D
## 39 12 D
## 40 6 D
## 41 4 D
## 42 3 D
## 43 5 D
## 44 5 D
## 45 5 D
## 46 5 D
## 47 2 D
## 48 4 D
## 49 3 E
## 50 5 E
## 51 3 E
## 52 5 E
## 53 3 E
## 54 6 E
## 55 1 E
## 56 1 E
## 57 3 E
## 58 2 E
## 59 6 E
## 60 4 E
## 61 11 F
## 62 9 F
## 63 15 F
## 64 22 F
## 65 15 F
## 66 16 F
## 67 13 F
## 68 10 F
## 69 26 F
## 70 26 F
## 71 24 F
## 72 13 F
mean(count[spray=="A"])
## [1] 14.5
# Look at the mean
tapply(count, spray, mean)
## A B C D E F
## 14.500000 15.333333 2.083333 4.916667 3.500000 16.666667
tapply(count, spray, var)
## A B C D E F
## 22.272727 18.242424 3.901515 6.265152 3.000000 38.606061
tapply(count, spray, length)
## A B C D E F
## 12 12 12 12 12 12
boxplot(count ~ spray)
Run ANOVA H0: Mean count is the same for all sprays
ANOVA.Sprays <- aov(count ~ spray, data=InsectSprays)
summary(ANOVA.Sprays)
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 2669 533.8 34.7 <2e-16 ***
## Residuals 66 1015 15.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As P-value is smaller tham 0.05 we will reject the H0; and conclude that not are the means are the same.
plot((ANOVA.Sprays))
attributes(ANOVA.Sprays)
## $names
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "contrasts" "xlevels" "call" "terms"
## [13] "model"
##
## $class
## [1] "aov" "lm"
ANOVA.Sprays$coefficients
## (Intercept) sprayB sprayC sprayD sprayE sprayF
## 14.5000000 0.8333333 -12.4166667 -9.5833333 -11.0000000 2.1666667
Run the multiple comparison to define that which means may differ from the others. All possible pairewise comaprison of the
TukeyHSD(ANOVA.Sprays)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = count ~ spray, data = InsectSprays)
##
## $spray
## diff lwr upr p adj
## B-A 0.8333333 -3.866075 5.532742 0.9951810
## C-A -12.4166667 -17.116075 -7.717258 0.0000000
## D-A -9.5833333 -14.282742 -4.883925 0.0000014
## E-A -11.0000000 -15.699409 -6.300591 0.0000000
## F-A 2.1666667 -2.532742 6.866075 0.7542147
## C-B -13.2500000 -17.949409 -8.550591 0.0000000
## D-B -10.4166667 -15.116075 -5.717258 0.0000002
## E-B -11.8333333 -16.532742 -7.133925 0.0000000
## F-B 1.3333333 -3.366075 6.032742 0.9603075
## D-C 2.8333333 -1.866075 7.532742 0.4920707
## E-C 1.4166667 -3.282742 6.116075 0.9488669
## F-C 14.5833333 9.883925 19.282742 0.0000000
## E-D -1.4166667 -6.116075 3.282742 0.9488669
## F-D 11.7500000 7.050591 16.449409 0.0000000
## F-E 13.1666667 8.467258 17.866075 0.0000000
Visual display of paired comparisons of the mean of insect count for sprays
plot(TukeyHSD(ANOVA.Sprays), las=1)
Kruskal Wallis One-Way Analysis of Varianve is a non-parameteric equivalent to the one-way Analysis of Variance
kruskal.test(count ~ spray, data=InsectSprays)
##
## Kruskal-Wallis rank sum test
##
## data: count by spray
## Kruskal-Wallis chi-squared = 54.691, df = 5, p-value = 1.511e-10
We will reject the null hypothesis.
Normal Distribution
A common distribution which is used in several statistical tests is a normal distribution. A standard normal distributed dataset is a dataset with standard deviation \(\sigma=1\) and mean \(\mu=0\), if you don’t provide these arguments, then these values are assumed.
pnorm(q) gives the integrated probability up to q standard deviations away from the mean
pnorm(q = 2)
## [1] 0.9772499
pnorm(q = 0)
## [1] 0.5
qnorm(p) gives the number of standard deviations away from the mean to get an integrated probability of p
qnorm(p = 0.95)
## [1] 1.644854
qnorm(p = 0.025, mean = 2, sd = 0.5)
## [1] 1.020018
rnorm(n) gives a vector of n random samples from the standard normal distribution (or you can specify a mean and a standard deviation)
rnorm(n = 4, mean = 2, sd = 2)
## [1] 5.2152317 0.5935958 2.4702616 3.7358675
Visual ways of checking the distribution of data are histogram, boxplots, and Q-Q plot.
When A and B have the same distribution the Q-Q plot is a 45 degree straight line. In many instances we need to know if a sample comes from a normal distribution, then the Q-Q plot of A against the standard normal distribution. NOTE: This will be a straight line if the distribution of A is normal of any mean and standard deviation. There is an R function that does all of this: qqnorm
v <- rnorm(100, 3, 2)
qqnorm(v)
qqline(v, col = 2)
The Q-Q line is drawn so that it passes through the first and third quantile. All of the points should be on this line when the sample is normal.
v <- rnorm(100, 0, 1)
qqnorm(v)
qqline(v, col = 2)
v <- rnorm(10000, 0, 1)
qqnorm(v)
qqline(v, col = 2)
Q-Q Plot to Compare Two Samples
v1 <- rnorm(100, 0, 1)
v2 <- rnorm(150, 3, 1)
qqplot(v1, v2)